Content based recommendation system for movies [Baby Version]

Tutorials / Implementations
NLP
Develop a content-based recommendation system for movies.
Published

August 15, 2021

from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Overview about recommendation system and its application

Recommendation system is popular nowadays. They are used to predict the “rating” or “preference” that users would give to an item. Those information can be used to provide users useful suggestions. For example, Amazon uses it to suggest products to customes, while Nexflix uses it to recommend videos based on user’s favor.

Main types of recommendation system

Generally, there are three types of recommendation system: 1. Simple recommenders: provide recommendation based on items’ popularity or ratings. For example, the movies in IDMB top 250. 2. Content-based recommenders: suggest items based on other item properties. The system assumes that if a person likes a particular item, he or she will also like an item which is similar to it. For example, Netflix suggests new movies based on the user’s history. 3. Collaborative filtering engines: predict the rating or preference that a user would give an item based on past ratings and preferences of other users.

In this post, we will build a content-based recommendation system for movies using the MovieLens Dataset. Since the dataset is large (26 miliion ratings and 750,000 tag applications), we only use a subset of it for fast development.

Load dataset

You can download the dataset here.

import pandas as pd
metadata = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/dataset/archive/movies_metadata.csv", low_memory=False)
metadata.head(3)
adult belongs_to_collection budget genres homepage id imdb_id original_language original_title overview popularity poster_path production_companies production_countries release_date revenue runtime spoken_languages status tagline title video vote_average vote_count
0 False {'id': 10194, 'name': 'Toy Story Collection', ... 30000000 [{'id': 16, 'name': 'Animation'}, {'id': 35, '... http://toystory.disney.com/toy-story 862 tt0114709 en Toy Story Led by Woody, Andy's toys live happily in his ... 21.946943 /rhIRbceoE9lR4veEXuwCC2wARtG.jpg [{'name': 'Pixar Animation Studios', 'id': 3}] [{'iso_3166_1': 'US', 'name': 'United States o... 1995-10-30 373554033.0 81.0 [{'iso_639_1': 'en', 'name': 'English'}] Released NaN Toy Story False 7.7 5415.0
1 False NaN 65000000 [{'id': 12, 'name': 'Adventure'}, {'id': 14, '... NaN 8844 tt0113497 en Jumanji When siblings Judy and Peter discover an encha... 17.015539 /vzmL6fP7aPKNKPRTFnZmiUfciyV.jpg [{'name': 'TriStar Pictures', 'id': 559}, {'na... [{'iso_3166_1': 'US', 'name': 'United States o... 1995-12-15 262797249.0 104.0 [{'iso_639_1': 'en', 'name': 'English'}, {'iso... Released Roll the dice and unleash the excitement! Jumanji False 6.9 2413.0
2 False {'id': 119050, 'name': 'Grumpy Old Men Collect... 0 [{'id': 10749, 'name': 'Romance'}, {'id': 35, ... NaN 15602 tt0113228 en Grumpier Old Men A family wedding reignites the ancient feud be... 11.7129 /6ksm1sjKMFLbO7UY2i6G1ju9SML.jpg [{'name': 'Warner Bros.', 'id': 6194}, {'name'... [{'iso_3166_1': 'US', 'name': 'United States o... 1995-12-22 0.0 101.0 [{'iso_639_1': 'en', 'name': 'English'}] Released Still Yelling. Still Fighting. Still Ready for... Grumpier Old Men False 6.5 92.0

Our recommendation system will be based on the similarity between the movie overviews. Specifically, we will compute the pairwise cosine similarity scores for all movies and suggest the movies based on this score.

First of all, we have to transform the raw text to vector form sincewe cannot compute the similarity score directly from the raw text. In this post, we will compute the Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document.

from sklearn.feature_extraction.text import TfidfVectorizer

# Create a TF-IDF object and remove all english stop words in the document 
# before producing vector representation
tfidf = TfidfVectorizer(stop_words='english')
metadata['overview'] = metadata['overview'].fillna('')

tfidf_matrix = tfidf.fit_transform(metadata['overview'])
tfidf_matrix.shape
(45466, 75827)

From the shape of the matrix we can see that the vector has length of 75827 and we have 45466 movie overview in total.

tfidf.get_feature_names()[5000:5010]
['avails',
 'avaks',
 'avalanche',
 'avalanches',
 'avallone',
 'avalon',
 'avant',
 'avanthika',
 'avanti',
 'avaracious']

After generating vector for each movie overview, we can start computing the similarity score between them. There are many ways to do that besides cosine similarity, such as the manhantatan, euclidean, the Pearson, etc. There is no right or wrong answer to which score is the best. Different scores will work well in different situations. It is always encouraged to experiment with different metrics and choose the best.

from sklearn.metrics.pairwise import linear_kernel

cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)
cosine_sim.shape
(45466, 45466)
cosine_sim[1]
array([0.01504121, 1.        , 0.04681953, ..., 0.        , 0.02198641,
       0.00929411])
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()
def get_recommendations(title, cosine_sim=cosine_sim):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    return metadata['title'].iloc[movie_indices]
get_recommendations('The Dark Knight Rises')
45464             Satan Triumphant
45463                     Betrayal
45462          Century of Birthing
45461                       Subdue
45460                   Robin Hood
45459              Caged Heat 3000
45458          The Burkittsville 7
45457    Shadow of the Blair Witch
45456             House of Horrors
45455    St. Michael Had a Rooster
Name: title, dtype: object
get_recommendations('The Godfather')
1178               The Godfather: Part II
44030    The Godfather Trilogy: 1972-1990
1914              The Godfather: Part III
23126                          Blood Ties
11297                    Household Saints
34717                   Start Liquidation
10821                            Election
38030            A Mother Should Be Loved
17729                   Short Sharp Shock
26293                  Beck 28 - Familjen
Name: title, dtype: object

Discussion

Here we will discuss a bit the motivation behind TF-IDF

Term frequency Give a set of English text documents, we want to rank them by which document is more relevant to the query, for example, “the excellent student”. Firstly, we can simply filter out the documents that do not contain all 3 words - “the”, “excellent” and “student”. However, there are still many documents left. To further distinguish them, we might count the frequency of those 3 words in each document and rank them by corresponding frequencies. That frequency is called the term frequency. Since the length of the document may vary significantly, we often normalize the frequency of each word by the length of the document.

Inverse document frequency Some terms are more common than the other. For example, the term “the” is more popular than the word “excellent”. Term frequency tends to incorrectly emphasize documents which happen to use the word “the” more frequently, without giving enough weight to more meaningful terms such as “excellent” and “student”. Yet, the term “the” is not a good key word to distinguish the relevant and non-relevant documents. The inverse document frequency is used to diminished the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.